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processors may solve a variety 
of challenging computational 


problems. This tutorial 
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umerous advances have been made in developing intelligent 
systems, some inspired by biological neural networks. 
Researchers from many scientific disciplines are designing arti- 
ficial neural networks (ANNs) to solve a variety of problems in pattern 
recognition, prediction, optimization, associative memory, and control 
(see the “Challenging problems” sidebar). . 

Conventional approaches have been proposed for solving these prob- 
lems. Although successful applications can be found in certain well-con- 
strained environments, none is flexible enough to perform well outside 
its domain. ANNs provide exciting alternatives, and many applications 
could benefit from using them.!? 

This article is for those readers with little or no knowledge of ANNs to 
help them understand the other articles in this issue of Computer. We dis- 
cuss the motivations behind the development of ANNs, describe the basic 
biological neuron and the artificial computational model, outline net- 
work architectures and learning processes, and present some of the most 
commonly used ANN models. We conclude with character recognition, a 
successful ANN application. 





WHY ARTIFICIAL NEURAL NETWORKS? 

The long course of evolution has given the human brain many desir- 
able characteristics not present in von Neumann or modern parallel com- 
puters. These include Í 


e massive parallelism, 

distributed representation and computation, 
e learning ability, 

generalization ability, 

e adaptivity, | 

* inherent contextual information processing, 
e fault tolerance, and 

e low energy consumption. 


It is hoped that devices based on biological neural networks will possess 


- some of these desirable characteristics. 


Modern digital computers outperform humans in the domain of 
numeric computation and related symbol manipulation. However, 
humans can effortlessly solve complex perceptual problems (like recog- 
nizing a man in a crowd from a mere glimpse of his face) at such a high 
speed and extent as to dwarf the world’s fastest computer. Why is there 
such a remarkable difference in their performance? The biological neural 
system architecture is completely different from the von Neumann archi- 
tecture (see Table 1). This difference significantly affects the type of func- 
tions each computational model can best perform. 

Numerous efforts to develop “intelligent” programs based on von 
Neumann’s centralized architecture have not resulted in general-purpose 
intelligent programs. Inspired by biological neural networks, ANNs are 
massively parallel computing systems consisting of an exremely large num- 
ber of simple processors with many interconnections. ANN models attempt 
to use some “organizational” principles believed to be used in the human 


March 1996 


ww ai bbt. com TUOAO00 0 








EEEn mainin ceri eames: stots Sea See esote PE iT tsi noe she earn gate fy Teh TETAS Last Ta safe Fase P oT Rea eee PERRIS: 


Hares are no ‘raining ae with known class 


Bois adie aie the similarity 








Fun ion pproximation 
ypc a set of n labeled training ere (input-out- 

airs {0c Yih (Xo Yd- (Ky Vn}, have been generated 
iknown function 100 (subject to noise). The task 
amain is to find an estimate, say i, of 


., Yta in a time 
> the a is to predict the sample 
fire time e frin: Ba hasa 







oe eledlatine the S a completely dif- 
can be retrieved. Associative memory or con- 
pale TG äs the name IAPIES, can. be 







een by apart memory. iS a desirable i in 
waltinnedia information databases. — 






Computer 



























Control 
Consider a dynamic system define 
where u(2) is the control input and 
put of the system at time t. In 
control, the goal is to generat 
the system follows a desired ’ 
reference model. An example 
(Figure Ay). | bsg 


Cardiogram - 3 : | 
Pattern . | noe 
classifier . - 

Abnormal 
e 




















Over-fitting to 
noisy training data 


a 








True function 






Airplane partially 


Retrieved airplane 
occluded by clouds ca 







Associative 
memory 





Load torque 





Controller 


Figure A. Tasks that neural netwo 
(1) pattern classification; (2) cle 
(3) function: approximation; (i 
(5) optimization (a TSP prob - 
by content; and (7) control (e elf 
from. DARPA Neural Network kStudy 

















ww ai bbt. com PO00D0D0OO0O 





brain. Modeling a biological nervous system using ANNs 
can also increase our understanding of biological functions. 
State-of-the-art computer hardware technology (such as 
VLSI and optical) has made this modeling feasible. 

A thorough study of ANNs requires knowledge of neu- 
rophysiology, cognitive science/psychology, physics (sta- 
tistical mechanics), control theory, computer science, 
artificial intelligence, statistics/mathematics, pattern 
recognition, computer vision, parallel processing, and 
hardware (digital/analog/VLSI/optical). New develop- 
ments in these disciplines continuously nourish the field. 
On the other hand, ANNs also provide an impetus to these 
disciplines in the form of new tools and representations. 
This symbiosis is necessary for the vitality of neural net- 
work research. Communications among these disciplines 
ought to be encouraged. 


Brief historical review 

ANN research has experienced three periods of exten- 
sive activity. The first peak in the 1940s was due to 
McCulloch and Pitts’ pioneering work.4 The second 
occurred in the 1960s with Rosenblatt’s perceptron con- 
vergence theorem’ and Minsky and Papert’s work showing 
the limitations of a simple perceptron.® Minsky and 
Papert’s results dampened the enthusiasm of most 
researchers, especially those in the computer science com- 
munity. The resulting lull in neural network research 
lasted almost 20 years. Since the early 1980s, ANNs have 
received considerable renewed interest. The major devel- 
opments behind this resurgence include Hopfield’s energy 
approach’ in 1982 and the back-propagation learning 
algorithm for multilayer perceptrons (multilayer feed- 
forward networks) first proposed by Werbos,* reinvented 
several times, and then popularized by Rumelhart et al.? 
in 1986. Anderson and Rosenfeld” provide a detailed his- 
torical account of ANN developments. 


Biological neural networks 
=- Aneuron (or nerve cell) is a special biological cell that 
processes information (see Figure 1). It is composed of a 
cell body, or soma, and two types of out-reaching tree-like 
branches: the axon and the dendrites. The cell body has a 
nucleus that contains information about hereditary traits 
and a plasma that holds the molecular equipment for pro- 
ducing material needed by the neuron. A neuron receives 
signals (impulses) from other neurons through its dendrites 
(receivers) and transmits signals generated by its cell body 
along the axon (transmitter), which eventually branches 
into strands and substrands. At the terminals of these 
strands are the synapses. A synapse is an elementary struc- 
ture and functional unit between two neurons (an axon 
strand of one neuron and a dendrite of another). When the 
impulse reaches the synapse’s terminal, certain chemicals 
called neurotransmitters are released. The neurotransmit- 
ters diffuse across the synaptic gap, to enhance or inhibit, 
depending on the type of the synapse, the receptor neuron’s 
own tendency to emit electrical impulses. The synapse’s 
effectiveness can be adjusted by the signals passing through 
it so that the synapses can learn from the activities in which 
they participate. This dependence on history acts as a mem- 
ory, which is possibly responsible for human memory. 

The cerebral cortex in humans is a large flat sheet of neu- 








Figure 1. A sketch of a biological neuron. 


rons about 2 to 3 millimeters thick with a surface area of 
about 2,200 cm?, about twice the area of a standard com- 
puter keyboard. The cerebral cortex contains about 10" 
neurons, which is approximately the number of stars in the | 
Milky Way." Neurons are massively connected, much more 
complex and dense than telephone networks. Each neuron 
is connected to 10? to 107 other neurons. In total, the human 
brain contains approximately 10% to 10% interconnections. 
Neurons communicate through a very short train of 
pulses, typically milliseconds in duration. The message is 
modulated on the pulse-transmission frequency. This fre- 
quency can vary from a few to several hundred hertz, which 
is a million times slower than the fastest switching speed in 
electronic circuits. However, complex perceptual decisions 
such as face recognition are typically made by humans 
within a few hundred milliseconds. These decisions are 
made by a network of neurons whose operational speed is 
only a few milliseconds. This implies that the computations 
cannot take more than about 100 serial stages. In other 
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Figure 2. McCulloch-Pitts model of a neuron. 


words, the brain runs parallel programs that are about 100 


steps long for such perceptual tasks. This is known as the 
hundred step rule.” The same timing considerations show 
that the amount of information sent from one neuron to 
another must be very small (a few bits). This implies that 
critical information is not transmitted directly, but captured 
and distributed in the interconnections—hence the name, 
connectionist model, used to describe ANNs. 

Interested readers can find more introductory and eas- 
ily comprehensible material on biological neurons and 
neural networks i in Brunak and Lautrup.” 


ANN OVERVIEW 


Computational models of neurons 
McCulloch and Pitts* proposed a binary threshold unit 


as a computational model for an artificial neuron (see . 


Figure 2). 
This mathematical neuron computes a weighted sum of 
.,m, and generates an out- 


Otherwise, an output of 0 results. Mathematically, 


: n 
y=0 } WX; -U |, 
j=l 


where 0(-) is a unit step function at 0, and w; is the synapse 


weight associated with the jth input. For simplicity of nota- 


tion, we often consider the threshold u as another weight 
w = —u attached to the neuron with a constant input x, 
= 1. Positive weights correspond to excitatory synapses, 
while negative weights model inhibitory ones. McCulloch 
and Pitts proved that, in principle, suitably chosen weights 
let a synchronous arrangement of such neurons perform 
universal computations. There is a crude analogy here to 
a biological neuron: wires and interconnections model 
axons and dendrites, connection weights represent 
synapses, and the threshold function approximates the 


_ activity in a soma. The McCulloch and Pitts model, how- 


ever, contains a number of simplifying assumptions that 
do not reflect the true behavior of biological neurons. 
The McCulloch-Pitts neuron has been generalized in 
many ways. An obvious one is to use activation functions 
other than the threshold function; such as piecewise lin- 
ear, sigmoid, or Gaussian, as shown in Figure 3. The sig- 
moid function is by far the most frequently used in ANNs. 


It is a strictly increasing function that exhibits smoothness 


Computer 


and has the desired asymptotic properties. 
The standard sigmoid function is the logis- 
tic function, defined by 


gx) = TATE exp{-B Ps 
where ß is the slope poate 


Network architectures 
ANNs can be viewed as weighted directed 
graphs in which artificial neurons are 
nodes and directed edges (with weights) 
are connections between neuron outputs 
and neuron inputs. | 

Based on the connection pattern (architecture), ANNs’ 
can be grouped into two categories (see a 4): 


* feed-forward networks, in whieh graphs have no > 
loops, and : a 

e recurrent (or feedback) networks, in which loops 
occur because of feedback connections. 


In the most common family of feed-forward networks, 
called multilayer perceptron, neurons are organized into 
layers that have unidirectional connections between them. 
Figure 4 also shows typical networks for each category. 

Different connectivities yield different network behav- 
iors. Generally speaking, feed-forward networks are sta- 
tic, that is, they produce only one set of output values 
rather than a sequence of values from a given input. Feed- 
forward networks are memory-less in the sense that their 
response to an input is independent of the previous net- 
work state. Recurrent, or feedback, networks, on the other 
hand, are dynamic systems. When a new input pattern is 
presented, the neuron outputs are computed. Because of 
the feedback paths, the inputs to each neuron are then. 
modified, which leads the network to enter a new state. 

Different network architectures require appropriate 
learning algorithms. The next section provides an 
overview of learning processes. 


Learning 

The ability to learn is a fundamental trait of intelligence. 
Although a precise definition of learning is difficult to for- 
mulate, a learning process in the ANN context can be 
viewed as the problem of updating network architecture 
and connection weights so that a network can efficiently 
perform a specific task. The network usually must learn 
the connection weights from available training patterns. 
Performance is improved over time by iteratively updat- 
ing the weights in the network. ANNs’ ability to auto- 
matically learn from examples makes them attractive and | 
exciting. Instead of following a set of rules specified by 
human experts, ANNs appear to learn underlying rules 
(like input-output relationships) from the given collec- 
tion of representative examples. This is one of the major 


advantages of neural networks over traditional expert sys- 


tems. 
To understand or design a learning process, you must 


-first have a model of the environment in which a neural 


network operates, that is, you must know what informa- 
tion is available to the network. We refer to this model as 
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- Gaussian. 


Figure 3. Different types of activation functions: (a) threshold, (b) piecewise linear, © sigmoid, and (d) 


Neural networks 





Figure 4. A taxonomy of feed-forward and recurrent/feedback network architectures. 


a learning paradigm.? Second, you must understand how 
network weights are updated, that is, which learning rules 
govern the updating process. A learning algorithm refers 
to a procedure in which learning rules are used for adjust- 
ing the weights. | 

There are three main learning paradigms: supervised, 
unsupervised, and hybrid. In supervised learning, or 
learning with a “teacher,” the network is provided with a 
correct answer (output) for every input pattern. Weights 
are determined to allow the network to produce answers 


as close as possible to the known correct answers. 


Reinforcement learning is a variant of supervised learn- 
ing in which the network is provided with only a critique 
on the correctness of network outputs, not the correct 
answers themselves. In contrast, unsupervised learning, or 
learning without a teacher, does not require a correct 
answer associated with each input pattern in the training 
data set. It explores the underlying structure in the data, 
or correlations between patterns in the data, and orga- 
nizes patterns into categories from these correlations. 
Hybrid learning combines supervised and unsupervised 
learning. Part of the weights are usually determined 
through supervised learning, while the others are 
obtained through unsupervised learning. 

Learning theory must address three fundamental and 
practical issues associated with learning from samples: 
capacity, sample complexity, and computational com- 
plexity. Capacity concerns how many patterns can be 


stored, and what functions and decision boundaries a net- 


- work can form. 


~ Sample complexity determines the number of training 
patterns needed to train the network to guarantee a valid 
generalization. Too few patterns may cause “over-fitting” 
(wherein the network performs well on the training data 
set, but poorly on independent test patterns drawn from the 
same distribution as the training patterns, as in Figure A3). 
Computational complexity refers to the time required 
for a learning algorithm to estimate a solution from train- 
ing patterns. Many existing learning algorithms have high 
computational complexity. Designing efficient algorithms 
for neural network learning is a very active research topic. 
There are four basic types of learning rules: error- 
correction, Boltzmann, Hebbian, and competitive learning. 


ERROR-CORRECTION RULES. In the supervised learn- 
ing paradigm, the network is given a desired output for 
each input pattern. During the learning process, the actual 
output y generated by the network may not equal the 
desired output d. The basic principle of error-correction 
learning rules is to use the error signal (d — y) to modify 
the connection weights to gradually reduce this error. ` 

The perceptron learning rule is based on this error-cor- 
rection principle. A perceptron consists of a single neuron 
with adjustable weights, w,,j = 1, 2,...,n, and threshold 
u, as shown in Figure 2. Given an input vector x= (x, Xz» 
...,X,), the net input to the neuron is 
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The outputy of the perceptron is +1 ifv > 0, and 0 oth- 
erwise. In a two-class classification problem, the percep- 
tron assigns an input pattern to one class ify = 1, and to 
the other class if y=0. The linear equation 


n 
} wx 
j=1 


defines the decision boundary (a hyperplane in the 
n-dimensional input space) that halves the space. 
Rosenblatt developed a learning procedure to deter- 
mine the weights and threshold in a perceptron, given a 
set of training patterns (see the “Perceptron learning algo- 
rithm” sidebar). | 
_ Note that learning occurs only when the perceptron 
makes an error. Rosenblatt proved that when training pat- 
terns are drawn from two linearly separable classes, the 
perceptron learning procedure converges after a finite 
number of iterations. This is the perceptron convergence 
theorem. In practice, you do not know whether the pat- 
terns are linearly separable. Many variations of this learn- 
ing algorithm have been proposed in the literature.” Other 
activation functions that lead to different learning char- 
acteristics can also be used. However, a single-layer per- 





Figure 5. Orientation selectivity of a single neuron 
trained using the Hebbian rule. 
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ceptron can only separate linearly separable patterns as long 
as a monotonic activation function is used. 

The back-propagation learning algorithm (see the 
“Back-propagation algorithm sidebar”) is also based on 
the error-correction. pHnele. 


BOLTZMANN LEARNING. Boltzmann machines are sym- 
metric recurrent networks consisting of binary units (+1 
for “on” and —1 for “off’). By symmetric, we mean that the 
weight on the connection from uniti to unitj is equal to the 
weight on the connection from unitj to unit i (w; = w). A 
subset of the neurons, called visible, interact with the envi- 
ronment; the rest, called hidden, do not. Each neuron is a 
stochastic unit that generates an output (or state) accord- 
ing to the Boltzmann distribution of statistical mechanics. 
Boltzmann machines operate in two modes: clamped, in. 
which visible neurons are clamped onto specific states deter- 
mined by the environment; and free-running, in which both 
visible and hidden neurons are allowed to operate freely. . 

Boltzmann learning is a stochastic learning rule derived . 
from information-theoretic and thermodynamic princi- 
ples.” The objective of Boltzmann learning is to adjust the 
connection weights so that the states of visible units satisfy 
a particular desired probability distribution. According to 
the Boltzmann learning rule, the change in the connec- 
tion weight w; is given by 

Aw, = NCP; — Pi); 
where n is the learning rate, and py and py are the corre- 
lations between the states of units i andj when the net- — 
work operates in the clamped mode and free-running 
mode, respectively. The values of p; and p; are usually esti- 
mated from Monte Carlo experiments, which are 
extremely slow. 

Boltzmann learning can be viewed as a special case of 
error-correction learning in which error is measured not 
as the direct difference between desired and actual out- 
puts, but as the difference between the correlations among 
the outputs of two neurons under clamped and free- 
running operating conditions. 


HEBBIAN RULE, The oldest learning rule is Hebb’s pos- 
tulate of learning.’ Hebb based it on the following obser- — 
vation from neurobiological experiments: If neurons on 
both sides of a synapse are activated synchronously and 
repeatedly, the synapse’s strength is selectively increased. 

Mathematically, the Hebbian rule can be described as 


N 


w,(t + 1) =w,() + NYO x0), 


where x, and y; are the output values of neurons i and j, 
respectively, which are connected by the synapse w,, and n 
is the learning rate. Note that x; is the input to the synapse. 

An important property of this rule is that learning is 
done locally, that is, the change in synapse weight depends 
only on the activities of the two neurons connected by it. 
This significantly simplifies the complexity of the learning 
circuit in a VLSI implementation. 

A single neuron trained using the Hebbian rule shies n 
an orientation selectivity. Figure 5 demonstrates this prop- 
erty. The points depicted are drawn from a two-dimen- 
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sional Gaussian distribution and used for training a neu- 
ron. The weight vector of the neuron is initialized to w, as 
shown in the figure. As the learning proceeds, the weight 
vector moves progressively closer to the direction w of 
maximal variance in the data. In fact, wis the eigenvector 
of the covariance matrix of the data corresponding to the 
largest eigenvalue. : 


COMPETITIVE LEARNING RULES. Unlike Hebbian learn- 
ing (in which multiple output units can be fired simulta- 
neously), competitive-learning output units compete 
among themselves for activation. As a result, only one out- 
put unit is active at any given time. This phenomenon is 
known as winner-take-all. Competitive learning has been 
found to exist in biological neural networks. 

Competitive learning often clusters or categorizes the 
input data. Similar patterns are grouped by the network 
and represented by a single unit. This grouping is done 
automatically based on data correlations. 

The simplest competitive learning network consists of a 
single layer of output units as shown in Figure 4. Each out- 
put unit iin the network connects to all the input units (xs) 
via weights, w,, j=1, 2,...,n. Each output unit also con- 
nects to all other output units via inhibitory weights but has 
a self-feedback with an excitatory weight. As a result of com- 
petition, only the unit i* with the largest (or the smallest) 
net input becomes the winner, that is, w- x 2 w; X, Vi, or 
|| wë — x || <|] w,—x ||, Vi. When all the weight vectors are 
normalized, these two inequalities are equivalent. 

A simple competitive learning rule can be stated as 


n(x? — Wisi), {=i 5 
AW.: = J J (1) 


0, LALI 


Note that only the weights of the winner unit get updated. 
The effect of this learning rule is to move the stored pat- 
tern in the winner unit (weights) a little bit closer to the 
input pattern. Figure 6 demonstrates a geometric inter- 
pretation of competitive learning. In this example, we 
assume that all input vectors have been normalized to have 
unit length. They are depicted as black dots in Figure 6. 
The weight vectors of the three units are randomly ini- 
tialized. Their initial and final positions on the sphere after 
competitive learning are marked as Xs in Figures 6a and 
6b, respectively. In Figure 6, each of the three natural 
groups (clusters) of patterns has been discovered by an 
output unit whose weight vector points to the center of 
gravity of the discovered group. 

You can see from the competitive learning rule that the 


network will not stop learning (updating weights) unless . 


the learning rate n is 0. A particular input pattern can fire 
different output units at different iterations during learn- 
ing. This brings up the stability issue of a learning system. 
The system is said to be stable if no pattern in the training 
data changes its category after a finite number of learning 
iterations. One way to achieve stability is to force the learn- 
ing rate to decrease gradually as the learning process pro- 
ceeds towards 0. However, this artificial freezing of learning 
causes another problem termed plasticity, which is the abil- 
ity to adapt to new data. This is known as Grossberg’s sta- 
bility-plasticity dilemma in competitive learning. 





Figure 6. An example of competitive learning: (a) | 
before learning; (b) after learning. | 


The most well-known example of competitive learning 
is vector quantization for data compression. It has been 


widely used in speech and image processing for efficient 


storage, transmission, and modeling. Its goal is to repre- 
sent a set or distribution of input vectors with a relatively 
small number of prototype vectors (weight vectors), ora 
codebook. Once a codebook has been constructed and 
agreed upon by both the transmitter and the receiver, you 
need only transmit or store the index of the corresponding 
prototype to the input vector. Given an input vector, its cor- 
responding prototype can be found by searching for the 
nearest prototype in the codebook. 


SUMMARY. Table 2 summaries various learning algo- 
rithms and their associated network architectures (this 
is not an exhaustive list). Both supervised and unsuper- 
vised learning paradigms employ learning rules based 
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Figure 7. A typical three-layer feed-forward network architecture. 


on error-correction, Hebbian, and competitive learning. 
Learning rules based on error-correction can be used for 
training feed-forward networks, while Hebbian learning 


tures. However, each learning algorithm is designed for 
training a specific architecture. Therefore, when we dis- 
cuss a learning algorithm, a-particular network archi-. 








rules have been used for all types of network architec- 
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Figure 8. A geometric interpretation of the role of hidden unit in a two-dimensional input space. 


perform only a few tasks well. The last column of Table 
2 lists the tasks that each algorithm can perform. Due to 
space limitations, we do not discuss some other algo- 
rithms, including Adaline, Madaline,* linear discrimi- 
nant analysis, Sammon’s projection, and principal 
component analysis.? Interested readers can consult the 
corresponding references (this article does not always 


cite the first paper proposing the particular algorithms). 


MULTILAYER FEED-FORWARD 
NETWORKS 

- Figure 7 shows a typical three-layer perceptron. In gen- 
eral, a standard L-layer feed-forward network (we adopt 
the convention that the input nodes are not counted as a 
layer) consists of an input stage, (L—1) hidden layers, and 
an output layer of units successively connected (fully or 
locally) in a feed-forward fashion with no connections 
between units in the same layer and no feedback connec- 
tions between layers. i 


Multilayer perceptron 

The most popular class of multilayer feed-forward net- 
works is multilayer perceptrons in which each computa- 
tional unit employs either the thresholding function or the 
sigmoid function. Multilayer perceptrons can form arbi- 
trarily complex decision boundaries and represent any 
Boolean function.* The development of the back-propa- 
gation learning algorithm for determining weights in a 
multilayer perceptron has made these networks the most 
popular among researchers and users of neural networks. 

We denote w, as the weight on the connection between 
the ith unit in layer (l-1) to jth unit in layer l. 

Let {(x®, d®), (x2, d@),..., (x, d®)} be a set of p 
training patterns (input-output pairs), where x® e R” is 


the input vector in the n-dimensional pattern space, and 
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d® € [0, 1]”, an m-dimensional hypercube. For classifi- 
cation purposes, m is the number of classes. The squared- 
error cost function most frequently used in the ANN 
literature is defined as | 


; | ~ (2) 








The back-propagation algorithm’ is a gradient-descent 
method to minimize the squared-error cost function in 
Equation 2 (see “Back-propagation algorithm” sidebar). 

A geometric interpretation (adopted and modified from 
Lippmann“) shown in Figure 8 can help explicate the role 
of hidden units (with the threshold activation function). 

Each unit in the first hidden layer forms a hyperplane 
in the pattern space; boundaries between pattern classes 
can be approximated by hyperplanes. A unit in the sec- 
ond hidden layer forms a hyperregion from the outputs 
of the first-layer units; a decision region is obtained by. 
performing an AND operation on the hyperplanes. The 
output-layer units combine the decision regions made by 
the units in the second hidden layer by performing logi- 
cal OR operations. Remember that this scenario is 
depicted only to explain the role of hidden units. Their 
actual behavior, after the network is trained, could differ. 

A two-layer network can form more complex decision 
boundaries than those shown in Figure 8. Moreover, mul- 


tilayer perceptrons with sigmoid activation functions can 


form smooth decision boundaries rather than piecewise 
linear boundaries. 


Radial Basis Function network 
The Radial Basis Function (RBF) network,’ which has 


two layers, is a special class of multilayer feed-forward net- 
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works. Each unit in the hidden layer employs a radial basis 


function, such as a Gaussian kernel, as the activation func- 
tion. The radial basis function (or kernel function) is cen- 
tered at the point specified by the weight vector associated 


= with the unit. Both the positions and the widths of these 


kernels must be learned from training patterns. There are 
usually many fewer kernels in the RBF network than there 
are training patterns. Each output unit implements a lin- 
ear combination of these radial basis functions. From the 
point of view of function approximation, the hidden units 
provide a set of functions that constitute a basis set for rep- 
resenting input patterns in the space spanned by the hid- 
den units. _ 

There are a variety of learning algorithms for the RBF 
network.’ The basic one employs a two-step learning strat- 
egy, or hybrid learning. It estimates kernel positions and 
kernel widths using an unsupervised clustering algorithm, 


followed by a supervised least mean square (LMS) algo- 


rithm to determine the connection weights between the 


hidden layer and the output layer. Because the output units 
are linear, a noniterative algorithm can be used.-After this- 


initial solution is obtained, a supervised gradient-based 
algorithm can be used to refine the network parameters. 
This hybrid learning algorithm for training the RBF net- 


| work converges much faster than the back-propagation 


algorithm for training multilayer perceptrons. However, 
for many problems, the RBF network often involves a 
larger number of hidden units. This implies that the run- 


time (after training) speed of the RBF network is often 


slower than the runtime speed of a multilayer perceptron. 
The efficiencies (error versus network size) of the RBF net- 
work and the multilayer perceptron are, however, prob- 
lem-dependent. It has been shown that the RBF network 
has the same asymptotic approximation power as a mul- 
tilayer perceptron. 





Computer 


Issues 
There are many issues in eani fea: forward net- 
works, including 


e how many layers are needed for a given task, 

e how many units are needed per layer, 

* how will the network perform on data not included in 
the training set (generalization ability), and _ 

e how large the training set should be for’ ‘good” gen- 
eralization. 


Although multilayer feed-forward networks using back- 
propagation have been widely employed for classification 
and function approximation,? many design parameters 
still must be determined by trial and error. Existing theo- 
retical results provide only very loose galines for select- 
ing these parameters in practice. 


KOHONEN'S SELF-ORGANIZING MAPS — 
The self-organizing map (SOM)* has the desirable prop- 
erty of topology preservation, which captures an impor- 
tant aspect of the feature maps in the cortex of highly .. 
developed animal brains. In a topology-preserving map- 
ping, nearby input patterns should activate nearby output 
units on the map. Figure 4 shows the basic network archi- 
tecture of Kohonen’s SOM. It basically consists of a two- 
dimensional array of units, each connected to all n input | 
nodes. Let w; denote the n-dimensional vector associated 
with the unit at location (i, j) of the 2D array. Each neuron 
computes the Euclidean distance between the pu vec-. 
tor x and the stored weight vector wy. 
This SOM is a special type of competitive (eam: net- 
work that defines a spatial neighborhood for each output 
unit. The shape of the local neighborhood can be square, 
rectangular, or circular. Initial neighborhood size is often 
set to one half to two thirds of the network size and shrinks’ 
over time according to a schedule (for example, an expo- 
nentially decreasing function). During competitive learn- 
ing, all the weight vectors associated with the winner and 
its neighboring units are updated (see the “SOM learning — 
algorithm” sidebar). | 
Kohonen’s SOM can be used for projection of multi- 
variate data, density approximation, and clustering. It has 
been successfully applied'in the areas of speech recogni- 
tion, image processing, robotics, and process control.? The © 
design parameters include the dimensionality of the neu- 
ron array, the number of neurons in each dimension, the 
shape of the neighborhood, the shrinking schedule ofthe 
neighborhood, and the learning rate. 


ADAPTIVE RESONANCE 
THEORY MODELS 
Recall that the stability-plasticity dilemma is an impor- . 
tant issue in competitive learning. How do we learn new 
things (plasticity) and yet retain the stability to ensure that 
existing knowledge is not erased or corrupted? Carpenter 
and Grossberg’s Adaptive Resonance Theory models 
(ART 1, ART2, and ARTMap) were developed in an attempt 
to overcome this dilemma.” The network has a sufficient 
supply of output units, but they are not used until deemed 
necessary. A unit is said to be committed (uncommitted) if 
it is Gs not) being used. The learning algorithm updates ` 
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the stored prototypes of a category only if 
the input vector is sufficiently similar to 
them. An input vector and a stored proto- 
type are said to resonate when they are suf- 
ficiently similar. The extent of similarity is 
controlled by a vigilance parameter, p, with 
0 < p < 1, which also determines the num- 
ber of categories. When the input vector is 
not sufficiently similar to any existing pro- 
totype in the network, a new category is 
created, and an uncommitted unit is 
assigned to it with the input vector as the 
initial prototype. If no such uncommitted 
unit exists, a novel input generates no 
response. 

We present only ART1, which takes 
binary (0/1) input to illustrate the model. 
Figure 9 shows a simplified diagram of the 
ART1 architecture.? It consists of two layers of fully con- 
nected units. A top-down weight vector w; is associated 
with unitj in the input layer, and a bottom-up weight vec- 
tor W; is associated with output unit i; Ww; is the normal- 
ized version of w;. 


— __ W 


= 3 

' e+}, Wii (3) 
where € is a small number used to break the ties in select- 
ing the winner. The top-down weight vectors w/s store 
cluster prototypes. The role of normalization is to prevent 
prototypes with a long vector length from dominating pro- 
totypes with a short one. Given an n-bit input vector x, the 
output of the auxiliary unit A is given by 


A= Spies SY x -nX 0 -0.5 , 
l F s 


where Sgna (x) is the signum function that produces +1 
if x> 0 and 0 otherwise, and the output of an input unit is 
given by 


t 


2%} 


Xj, ALWiO, 


if no output O; is "on", 


otherwise. 


A reset signal R is generated only when the similarity is 
less than the vigilance level. (See the “ART1 learning algo- 
rithm” sidebar.) 

The ART1 model can create new categories and reject 
an input pattern when the network reaches its capacity. 
However, the number of categories discovered in the input 
data by ART1 is sensitive to the vigilance parameter. 


HOPFIELD NETWORK 

Hopfield used a network energy function as a tool for 
designing recurrent networks and for understanding their 
dynamic behavior.’ Hopfield’s formulation made explicit 


Competitive (output) layer 
O 


Comparison (input) layer 





Figure 9. ART1 network. 


the principle of storing information as dynamically stable 
attractors and popularized the use of recurrent networks 
for associative memory and for solving combinatorial opti- 


_mization problems. ` 


A Hopfield network with n units has two versions: 
binary and continuously valued. Let y, be the state or out- 
put of the ith unit. For binary networks, v; is either +1 or 
—1, but for continuous networks, v; can be any value 
between 0 and 1. Let w; be the synapse weight on the con- 
nection from units i to j. In Hopfield networks, w; = W; 
Vi, j (symmetric networks), and w; = 0, Vi (no self-feed- 
back connections). The network dynamics for the binary 
Hopfield network are 


Y= Sen SY wy, — 6; (4) 
j 
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The dynamic update of network states in Equation 4 can 
be carried out in at least two ways: synchronously and asyn- 
chronously. In a synchronous updating scheme, all units 
are updated simultaneously at each time step. A central 
clock must synchronize the process. An asynchronous 
updating scheme selects one unit at a time and updates its 
state. The unit for updating can be randomly chosen. 

The energy function of the binary Hopfield network in 
a state v = Us Vo,.-+,V,)T is given by 


=~ Ne H (5) 


The central property of the energy function is that as net- 
work state evolves according to the network dynamics 
(Equation 4), the network energy always decreases and 
eventually reaches a local minimum point (attractor) 
where the network stays with a constant energy. 


| Associative memory 

When a set of patterns is stored in these network attrac- 
tors, it can be used as an associative memory. Any pattern 
_ present in the basin of attraction of a stored pattern can 
_be used as an index to retrieve it. 

An associative memory usually operates in two phases: 
storage and retrieval. In the storage phase, the weights in 
the network are determined so that the attractors of the 
network memorize a set of p n-dimensional patterns {x!, 
x’,,..., X?} to be stored. A generalization of the Hebbian 
learning rule can be used for setting connection weights 
wy. In the retrieval phase, the input pattern is used as the 
initial: state of the network, and the network evolves 
according to its dynamics. A pattern is produced (or 
retrieved) when the network reaches equilibrium. 

How many patterns can be stored in a network withn 
binary units? In other words, what is the memory capac- 
ity of a network? It is finite because a network with n 
binary units has a maximum of 2” distinct states, and not 
all of them are attractors. Moreover, not all attractors (sta- 
ble states) can store useful patterns. Spurious attractors 
can also store patterns different from those in the train- 
ing set.? l 

It has been shown that the maximum number of ran- 
dom patterns that a Hopfield network can store is P a ~= 
0.157. When the number of stored patterns p < 0.15n, a 
nearly perfect recall can be achieved. When memory pat- 
_ terns are orthogonal vectors instead of random patterns, 
more patterns can be stored. But the number of spurious 
attractors increases as p reaches capacity. Several learn- 
ing rules have been proposed for increasing the memory 
capacity of Hopfield networks.” Note that we require n? 
connections in the network to store p n-bit patterns. 


Energy minimization 

Hopfield networks always evolve in the direction that 
leads to lower network energy. This implies that if a com- 
binatorial optimization problem can be formulated as min- 
imizing this energy, the Hopfield network can be used to 
find the optimal (or suboptimal) solution by letting the 
network evolve freely. In fact, any quadratic objective func- 
tion can be rewritten in the form of Hopfield network 


Computer 


energy. For example, the classic Traveling Salesman 
Problem can be formulated as such a problem. 


APPLICATIONS 


We have discussed a number of important ANN sods ee 


and learning algorithms proposed in the literature. They 
have been widely used for solving the seven classes of 
problems described in the beginning of this article. Table | 
2 showed typical suitable tasks for ANN models and learn- 
ing algorithms. Remember that to successfully work with 
real-world problems, you must deal with numerous design 
issues, including network model, network size, activation 
function, learning parameters, and number of training | 
samples. We next discuss an optical character recognition 
(OCR) application to illustrate how multilayer feed- 
forward networks are successfully used in practice. 

OCR deals with the problem of processing a scanned 
image of text and transcribing it into machine-readable 
form. We outline the basic components of OCR and 
ae how ANNs are used for character classification. 


An OCR system 

An OGR system usually consists of modules for prepro- - 
cessing, segmentation, feature extraction, classification, 
and contextual processing. A paper document is scanned 
to produce a gray-level or binary (black-and-white) image. 
In the preprocessing stage, filtering is applied to remove 
noise, and text areas are located and converted to a binary 
image using a global or local adaptive thresholding method. 
in the segmentation step, the text image is separated into 
individual characters. This is a particularly difficult task 
with handwritten text, which contains a proliferation of 
touching characters. One effective technique is to break the 
composite pattern into smaller patterns (over-segmenta- 
tion) and find the correct character segmentation points 
using the output of a pattern classifier. = 

Because of various degrees of slant, skew, and noise 
level, and various writing styles, recognizing segmented 
characters is not easy. This is evident from Figure 10, which 
shows the size-normalized character bitmaps of a sample 
set from the NIST (National Institute of Standards and ` 
Technology) hand-print character database.® 


Schemes | 

Figure 11 shows the two main shens for using ANNs 
in an OCR system. The first one employs an explicit fea- 
ture extractor (not necessarily a neural network). For 
instance, contour direction features are used in Figure 11. 
The extracted features are passed to the input stage ofa : 
multilayer feed-forward network.” This scheme is very 
flexible in incorporating a large variety of features. The © 
other scheme does not explicitly extract features from the 
raw data. The feature extraction implicitly takes place 
within the intermediate stages (hidden layers) of the ANN. 
A nice property of this scheme is that feature extraction -~ 
and classification are integrated and trained simultane- 
ously to produce optimal classification results. It is not | 
clear whether the types of features that can be extracted ~ 
by this integrated architecture are the most effective for 
character recognition. Moreover, this scheme requires a 
much larger network than the first one. 
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Atypical example of this integrated fea- 
ture extraction-classification scheme is the 
network developed by Le Cun et al.” for zip 
code recognition. A 16 x 16 normalized 
gray-level image is presented to a feed-for- 
ward network with three hidden layers. 
The units in the first layer are locally con- 
nected to the units in the input layer, form- 
ing a set of local feature maps. The second 
hidden layer is constructed in a similar 
way. Each unit in the second layer also 
combines local information coming from 
feature maps in the first layer. 

The activation level of an output unit can 
be interpreted as an approximation of the 
a posteriori probability of the input pat- 
tern’s belonging to a particular class. The 
output categories are ordered according to 
activation levels and passed to the post- 
processing stage. In this stage, contextual 
information is exploited to update the clas- 
sifier’s output. This could, for example, 
involve looking up a dictionary of admissi- 
ble words, or utilizing syntactic constraints 
present, for example, in phone or social 
security numbers. 





Results 

ANNs work very well in the OCR applicaion:] However, 
there is no conclusive evidence about their superiority over 
conventional statistical pattern classifiers. At the First 
Census Optical Character Recognition System Conference 
held in 1992, more than 40 different handwritten char- 
acter recognition systems were evaluated based on their 
performance on a common database. The top 10 perform- 
ers used either some type of multilayer feed-forward net- 
work or a nearest neighbor-based classifier. ANNs tend to 


be superior in terms of speed and memory requirements 


compared to nearest neighbor methods. Unlike the nearest 
neighbor methods, classification speed using ANNs is inde- 
` pendent of the size of the training set. The recognition accu- 
racies of the top OCR systems on the NIST isolated 
(presegmented) character data were above 98 percent for 
digits, 96 percent for uppercase characters, and 87 percent 
for lowercase characters. (Low recognition accuracy for 
lowercase characters was largely due to the fact that the 
test data differed significantly from the training data, as 
well as being due to “ground-truth” errors.) One conclu- 
sion drawn from the test is that OCR system performance 
on isolated characters compares well with human perfor- 
mance. However, humans still outperform OCR systems 
on unconstrained and cursive handwritten documents. 


DEVELOPMENTS IN ANNS HAVE STIMULATED a lot of enthusi- 


asm and criticism. Some comparative studies are optimistic, 
some offer pessimism. For many tasks, such as pattern 


recognition, no one approach dominates the others. The | 


choice of the best technique should be driven by the given 
application’s nature. We should try to understand the capac- 
ities, assumptions, and applicability of various approaches 
and maximally exploit their complementary advantages to 
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Figure 11. Two schemes for using ANNs in an OCR 
system. 





develop better intelligent systems. Such an effort may lead 
to a synergistic approach that combines the strengths of 
ANNs with other technologies to achieve significantly bet- 
ter performance for challenging problems. As Minsky”! 
recently observed, the time has come to build systems out 
of diverse components. Individual modules are important, 
but we also need a good methodology for integration. It is 
clear that communication and cooperative work between 
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researchers working in ANNs and other disciplines will not 


only avoid repetitious work but (and more important) will 
| stimulate and benefit individual disciplines. I | 
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